• Hadoop Website Analyst
• 数据处理平台
• 数据分析 & 预处理
• Mapper Execution 程序
• Reducer Execution 程序
• Hadoop 集群处理完整数据
• Reference
Hadoop Website Analyst
要求对网站访问数据进行分析,以日期为单位,从高到低统计出每日网址的访问排名。
数据处理平台
采用以伪分布方式部署在 Docker 容器中的 Hadoop 3.4.1 进行数据分析。
数据分析 & 预处理
打开 Comma-separated values (CSV) 格式的数据集,截取其中一小段查看如下
____________________________________________________________________________________________________________________________________________________
| o o o csv |
|====================================================================================================================================================|
| 5813192,1688,109495,A8469148F4D543D74ACED3F4A2A115EC,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 5813194,1688,109495,A8469148F4D543D74ACED3F4A2A115EC,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 |
| 5813195,1688,109126,B02879C663DFBB315CD8E357C460F8B1,2020/12/31 11:57,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 |
| 5813174,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,/tzjingsai/1628.jhtml |
| 5813175,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,http://www.tipdm.org/tzjingsai/1628.jhtml |
| 5813176,1628,90893,1BBDF304E8183F83A706A2AA43C3A9C0,2020/12/31 11:55,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_106 |
| 5813148,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:48,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 |
| 5813149,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:48,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 |
| 5813147,1693,85440,0DDFAC4CA65D24D9726B9D765ECB504E,2020/12/31 11:47,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 |
| 5813116,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,/tzjingsai/1628.jhtml |
| 5813117,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,http://www.tipdm.org/tzjingsai/1628.jhtml |
| 5813119,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:43,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_106 |
| 5813094,1692,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,https://www.tipdm.org/bdrace/news/20200908/1692.html |
| 5813095,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,/tzjingsai/1628.jhtml |
| 5813096,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,http://www.tipdm.org/tzjingsai/1628.jhtml |
| 5813099,1628,90893,36FFBD3CC7580CC4C046164CF5648909,2020/12/31 11:40,http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html?cName=ral_104 |
| 5813048,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 |
| 5813049,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 5813061,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:32,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 |
| 5813034,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 |
| 5813036,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_106 |
| 5813038,1693,87278,A59EA59CA11FADE6461F216A61AD6716,2020/12/31 11:29,https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html?cName=ral_105 |
| 5813010,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:21,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_104 |
| 5813012,1688,104842,F77914C7D45AEAEEC0FC02F7FF19D03C,2020/12/31 11:21,https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html?cName=ral_106 |
'===================================================================================================================================================='
从中可以初步得出几个特征:
• 日期信息在第 5 列。
• 网址信息在第 6 列。
• 有的 URL 有
bdrace/ 目录,有的没有。与采用 HTTP 或 HTTPS 协议无关。
• 数据集中含有不完整的 URL,页面都使用
JHTML 实现。
• 第二列数据与 html 页面的命名对应。
• 所有 html 相同的条目,第 2~4 列条目内容都一致,应该是某种标识符。
• 第一列应该是条目的编号。
为了实现以日期为单位对网站访问排名进行统计的目的,我们主要关注
第 5、6 列信息,即日期和 URL。为了保证信息准确,除了处理缺失的数据,和去除 URL 中的查询和片段标志外,均不对 URL 进行其他处理,使用原样数据进行分析。
程序均使用
Hadoop Streaming 工具运行,使得数据处理框架与编程语言无关,下面的 Mapper Execution(以下简称 Mapper)和 Reducer Execution(以下简称 Reducer)均由 Python 实现:
Mapper Execution 程序
#!/usr/bin/env python3
import sys
import re
try:
for line in sys.stdin:
line = line.strip().split(',')
# 提取日期
time = line[4].split(' ')[0]
# shuffle 按字典序排序,需要对日期进行补零
time = '/'.join(f'{s:0>2}' for s in time.split('/'))
url = line[5]
# 去除查询和片段,数据集中用的竟然是全角井号?
url = re.sub(r'(?<=html).*$', '', url)
print(f'{time}\t{url}')
except IndexError:
raise
Mapper 从标准输入读取 CSV 文件的条目,以
日期 制表符 URL 的形式将数据发往标准输出。
对 Mapper 程序测试:
cat test.csv | ./mapexe.py
截取的输出的数据如下:
_________________________________________________________________________________
| o o o bash |
|=================================================================================|
| 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 2020/12/21 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/12/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
| 2020/01/21 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html |
'================================================================================='
Reducer Execution 程序
Reducer Execution 程序(以下简称 Reducer),从标准输入中读取按日期排好序的 Tabstop-separated values (TSV) ,对输入的数据的 Value,即经过 Hadoop Shuffle 处理的 URL 添加进列表,然后将列表中的 URL 以访问次数进行排序。最后以
升序的日期 制表符 降序的网址和其访问数量 的形式将数据发往标准输出。
#!/usr/bin/env python3
import sys
from collections import defaultdict
from collections import Counter
from pprint import pprint
url_list = defaultdict(list)
for line in sys.stdin:
k, v = line.strip().split('\t', 1)
url_list[k].append(v)
for k, v in url_list.items():
url_list[k] = Counter(v)
for time, counter in url_list.items():
counter: Counter
for url, count in counter.most_common():
print(time, url, count, sep='\t')
对 Reducer 程序进行模拟测试:
cat test.csv | ./mapexe.py | sort | ./reducexe.py
截取的输出的数据如下:
_____________________________________________________________________________________
| o o o bash |
|=====================================================================================|
| 2020/12/31 https://www.tipdm.org/bdrace/jljingsai/20200903/1688.html 124 |
| 2020/12/31 https://www.tipdm.org/bdrace/dwqbygajrsxjmjs/20200916/1693.html 117 |
| 2020/12/31 https://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 20 |
| 2020/12/31 https://www.tipdm.org/bdrace/wq1jszx/20201222/1731.html 11 |
| 2020/12/31 /tzjingsai/1628.jhtml 9 |
| 2020/12/31 http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 9 |
| 2020/12/31 http://www.tipdm.org/tzjingsai/1628.jhtml 8 |
| 2020/12/31 https://www.tipdm.org/bdrace/jsgsmcm/20190402/1564.html 8 |
| 2020/12/31 https://www.tipdm.org/bdrace/jn3jszx/20201223/1732.html 7 |
| 2020/12/31 https://www.tipdm.org/bdrace/tabyxlw/20201202/1727.html 4 |
| 2020/12/31 http://www.tipdm.org/bdrace/jljingsai/20190809/1605.html 3 |
| 2020/12/31 /tj/661.jhtml 2 |
| 2020/12/31 http://www.tipdm.org/bdrace/jsgsmcm/20190402/1564.html 2 |
| 2020/12/31 http://www.tipdm.org/tj/1266.jhtml 2 |
| 2020/12/31 http://www.tipdm.org/tj/535.jhtml 2 |
| 2020/12/31 http://www.tipdm.org/tj/578.jhtml 2 |
| 2020/12/31 http://www.tipdm.org/ts/661.jhtml 2 |
'====================================================================================='
可以看到,程序可以按日期顺序,对每日的网站访问次序按照降序进行排序。
Hadoop 集群处理完整数据
project 目录结构如下:
__________________________
| o o o bash |
|==========================|
| . |
| ├── init.sh |
| └── site_visitors |
| ├── mapexe.py |
| ├── reducexe.py |
| ├── site_visitors.sh |
| ├── test.csv |
| └── visitors.csv |
'=========================='
初始化脚本
init.sh 如下,该脚本完成数据文件的上传,并在目标目录生成 Hadoop 启动脚本:
#!/bin/bash
project_name=site_visitors
job_file=visitors.csv
cat <<-SCRIPT > ${project_name}/${project_name}.sh
#!/bin/bash -x
# create folder
hdfs dfs -mkdir -p input_${project_name}
hdfs dfs -put -f ${job_file}
# start job
mapred streaming \
-input ${job_file} \
-output output_${project_name} \
-mapper mapexe.py \
-reducer reducexe.py \
-file mapexe.py \
-file reducexe.py
SCRIPT
# transmit file
sudo docker cp ${project_name} hadoop_single_node:/home/singlenode/
sudo docker exec hadoop_single_node sudo chown -R singlenode:singlenode ${project_name}
sudo docker exec hadoop_single_node sudo chmod -R +x ${project_name}/*.py ${project_name}/*.sh
启动 Docker 容器,进入目标目录,运行脚本,创建 Hadoop Job,截取部分输出如下:
______________________________________________________________________________________________________________________________________________
| o o o bash |
|==============================================================================================================================================|
| singlenode@singlenode:~/site_visitors$ ls |
| mapexe.py reducexe.py site_visitors.sh test.csv visitors.csv |
| singlenode@singlenode:~/site_visitors$ ./site_visitors.sh |
| + hdfs dfs -mkdir -p input_site_visitors |
| + hdfs dfs -put -f visitors.csv |
| + mapred streaming -input visitors.csv -output output_site_visitors -mapper mapexe.py -reducer reducexe.py -file mapexe.py -file reducexe.py |
| 2024-12-10 05:10:21,662 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead. |
| packageJobJar: [mapexe.py, reducexe.py] [] /tmp/streamjob14897470953513239277.jar tmpDir=null |
| 2024-12-10 05:10:22,763 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties |
| 2024-12-10 05:10:22,900 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). |
| 2024-12-10 05:10:22,901 INFO impl.MetricsSystemImpl: JobTracker metrics system started |
| 2024-12-10 05:10:22,918 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized! |
| 2024-12-10 05:10:23,175 INFO mapred.FileInputFormat: Total input files to process : 1 |
| 2024-12-10 05:10:23,251 INFO mapreduce.JobSubmitter: number of splits:1 |
| 2024-12-10 05:10:23,447 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1924643391_0001 |
| 2024-12-10 05:10:23,448 INFO mapreduce.JobSubmitter: Executing with tokens: [] |
| |
| [...] |
'=============================================================================================================================================='
查看 HDFS 下的任务输出文件,可以看到,任务的运行结果与测试一致,但是是以分布式计算的方式处理的:
__________________________________________________________________________________________________
| o o o bash |
|==================================================================================================|
| singlenode@singlenode:~/site_visitors$ hdfs dfs -ls output_site_visitors |
| Found 2 items |
| -rw-r--r-- 3 singlenode supergroup 0 2024-12-10 05:10 output_site_visitors/_SUCCESS |
| -rw-r--r-- 3 singlenode supergroup 992232 2024-12-10 05:10 output_site_visitors/part-00000 |
| singlenode@singlenode:~/site_visitors$ hdfs dfs -head output_site_visitors/part-00000 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 413 |
| 2020/05/13 /tzjingsai/1628.jhtml 61 |
| 2020/05/13 http://www.tipdm.org/tzjingsai/1628.jhtml 57 |
| 2020/05/13 https://www.tipdm.org/bdrace/tzjingsai/20200113/1628.html 39 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzbstysj/20200228/1637.html 31 |
| 2020/05/13 http://www.tipdm.org/tj/1615.jhtml 26 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzjingsai/20181226/1544.html 25 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzbszjs/20200203/1632.html 23 |
| 2020/05/13 http://www.tipdm.org/tj/661.jhtml 20 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzqhjmd/20190604/1583.html 15 |
| 2020/05/13 http://www.tipdm.org/bdrace/jljingsai/20190809/1605.html 13 |
| 2020/05/13 /tj/1615.jhtml 13 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzbstysj/20200410/1640.html 12 |
| 2020/05/13 http://www.tipdm.org/bdrace/tzbjszx/20200203/1639.html 12 |
| 2020/05/13 http://www.tipdm.org/ts/661.jhtml 12 |
| 2020/05/13 /tj/661.jhtml 10 |
| 2020/05/13 http://www.tipdm.org/bdrace/jljingsai/20181008/1488.html 6 |
'=================================================================================================='
Reference
•
Apache Hadoop 3.4.1 – Hadoop: Setting up a Single Node Cluster.
•
Apache Hadoop 3.4.1 – HDFS Commands Guide
•
Apache Hadoop MapReduce Streaming – Hadoop Streaming
Create: Thu Dec 12 21:49:50 2024
Last Modified: Thu Dec 12 21:49:50 2024
_____ _______ _____ _______
/\ \ /::\ \ /\ \ /::\ \
/::\____\ /::::\ \ /::\____\ /::::\ \
/::::| | /::::::\ \ /::::| | /::::::\ \
/:::::| | /::::::::\ \ /:::::| | /::::::::\ \
/::::::| | /:::/~~\:::\ \ /::::::| | /:::/~~\:::\ \
/:::/|::| | /:::/ \:::\ \ /:::/|::| | /:::/ \:::\ \
/:::/ |::| | /:::/ / \:::\ \ /:::/ |::| | /:::/ / \:::\ \
/:::/ |::|___|______ /:::/____/ \:::\____\ /:::/ |::| | _____ /:::/____/ \:::\____\
/:::/ |::::::::\ \ |:::| | |:::| | /:::/ |::| |/\ \ |:::| | |:::| |
/:::/ |:::::::::\____\|:::|____| |:::|____|/:: / |::| /::\____\|:::|____| |:::|____|
\::/ / ~~~~~/:::/ / \:::\ \ /:::/ / \::/ /|::| /:::/ / \:::\ \ /:::/ /
\/____/ /:::/ / \:::\ \ /:::/ / \/____/ |::| /:::/ / \:::\ \ /:::/ /
/:::/ / \:::\ /:::/ / |::|/:::/ / \:::\ /:::/ /
/:::/ / \:::\__/:::/ / |::::::/ / \:::\__/:::/ /
/:::/ / \::::::::/ / |:::::/ / \::::::::/ /
/:::/ / \::::::/ / |::::/ / \::::::/ /
/:::/ / \::::/ / /:::/ / \::::/ /
/:::/ / \::/____/ /:::/ / \::/____/
\::/ / \::/ /
\/____/ \/____/
_____ _____ _____ _____ _____
/\ \ /\ \ /\ \ /\ \ /\ \
/::\ \ /::\ \ /::\ \ /::\ \ /::\ \
/::::\ \ /::::\ \ /::::\ \ /::::\ \ /::::\ \
/::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \
/:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \
/:::/__\:::\ \ /:::/__\:::\ \ /:::/__\:::\ \ /:::/ \:::\ \ /:::/__\:::\ \
\:::\ \:::\ \ /::::\ \:::\ \ /::::\ \:::\ \ /:::/ \:::\ \ /::::\ \:::\ \
___\:::\ \:::\ \ /::::::\ \:::\ \ /::::::\ \:::\ \ /:::/ / \:::\ \ /::::::\ \:::\ \
/\ \:::\ \:::\ \ /:::/\:::\ \:::\____\ /:::/\:::\ \:::\ \ /:::/ / \:::\ \ /:::/\:::\ \:::\ \
/::\ \:::\ \:::\____\/:::/ \:::\ \:::| |/:::/ \:::\ \:::\____\/:::/____/ \:::\____\/:::/__\:::\ \:::\____\
\:::\ \:::\ \::/ /\::/ \:::\ /:::|____|\::/ \:::\ /:::/ /\:::\ \ \::/ /\:::\ \:::\ \::/ /
\:::\ \:::\ \/____/ \/_____/\:::\/:::/ / \/____/ \:::\/:::/ / \:::\ \ \/____/ \:::\ \:::\ \/____/
\:::\ \:::\ \ \::::::/ / \::::::/ / \:::\ \ \:::\ \:::\ \
\:::\ \:::\____\ \::::/ / \::::/ / \:::\ \ \:::\ \:::\____\
\:::\ /:::/ / \::/____/ /:::/ / \:::\ \ \:::\ \::/ /
\:::\/:::/ / /:::/ / \:::\ \ \:::\ \/____/
\::::::/ / /:::/ / \:::\ \ \:::\ \
\::::/ / /:::/ / \:::\____\ \:::\____\
\::/ / \::/ / \::/ / \::/ /
\/____/ \/____/ \/____/ \/____/